Enable static quantization for Qwen3-0.6B decoder (transformer-only)#836
Conversation
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Summary - structurally sound export, but registration/test/quant integration don't match repo conventions, and w8a16 accuracy regresses.
Nice work getting a fused GQA + LpNorm RMSNorm + 1x1-Conv transformer-only export running end-to-end on QNN, and the export itself is faithful - the FP optimized graph reproduces HF eager's next-token exactly. Three things to address before this is review-ready:
1. Registration is non-standard (highest priority). qwen_transformer_only.install() hot-patches the global registries at runtime and isn't imported by models/hf/__init__.py. Every other model registers declaratively at import time (@register_onnx_overwrite / @register_composite_model, merged in __init__.py). Please make this a first-class variant (distinct task/model_type or a build-config flag) instead of monkey-patching; it also removes the "must call install() before importing the composite machinery" ordering trap and the no-way-back override of the eager path.
2. Test & quant entry points violate repo layout. test_qwen.py and qwen3_transformer_only_quantize.py are standalone scripts at the repo root; test_qwen.py is a subprocess driver that judges success by artifact mtime and uses os._exit(0) to mask a native QNN/ORT teardown crash. Convention (tests/CLAUDE.md) is pytest under tests/. Move the runner to tests/e2e/ (or examples/), and wire the calibration reader into the config-driven quant flow (WinMLBuildConfig.quant) rather than a bespoke quantizer.
3. w8a16 accuracy is not yet acceptable. Measured against the FP graph on the same GSM8K-style input, the quantized model flips the top-1 next token on both prefill and decode (top-5 overlap 0-1/5, KL 0.66/2.75; hidden-state cosine 0.64-0.72), while present-KV stays ~0.999 - i.e. the residual stream is the casualty. Likely minmax + all-zero KV calibration + only 30 samples. Please try percentile/entropy calibration with a realistic non-zero KV feed and report an actual task metric, not just QDQ node count.
Naming and the custom-op export pattern look good and match the codebase.
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Code Review — PR #836 (Draft)
Well-structured PR. The transformer-only export topology (fused GQA, LpNorm RMSNorm, 1x1 Conv), GSM8K calibration pipeline, and model_type override mechanism are solid. A few correctness bugs and infrastructure concerns should be resolved before marking ready for merge.
Not approving since this is a draft PR.
Replace the standalone root-level quant driver and __main__/subprocess test runner with the regular build pipeline and pytest. - Move calibration logic into src/.../hf/qwen_transformer_only_quant.py; the decode wrapper exposes winml_finalize_quant_config, invoked generically from build/hf.py just before quantize_onnx. The build now quantizes via precision=w8a16 + config.quant instead of a separate script. - The hook reads seq_len / max_cache / GQA node names from the exported ONNX and selects the prefill vs decode-trajectory calibration reader, keeping the verified-good scheme (int8-symmetric weights, uint16 activations, minmax, GQA excluded from QDQ). - Delete root qwen3_transformer_only_quantize.py and test_qwen.py. - Add tests/unit/models/qwen_transformer_only (fast, offline) and tests/e2e/models/test_qwen3_transformer_only_quant.py (build+quant+decode-parity, QNN-gated NPU).
# Conflicts: # src/winml/modelkit/loader/config.py # src/winml/modelkit/models/auto.py
…c shapes - Add missing docstrings / return-type annotations and drop dead noqa directives across qwen3_export_ops.py, qwen3_modeling.py and the transformer-only registration so 'ruff check src/ tests/' (CI lint) passes. - build/hf.py: re-persist config.json after winml_finalize_quant_config runs, so the saved metadata reflects the actually-applied w8a16 scheme (int8/uint16/symmetry + GQA nodes_to_exclude) rather than the pre-finalize policy dtypes. - qwen_transformer_only_quant._graph_shapes: treat a non-positive dim_value (symbolic/dynamic axis) as a hard error instead of silently returning a zero-length shape.
…2e helper) - LpNormOnnxExport.forward now computes the real L2 normalization instead of a silent identity; export-invariant (node comes from symbolic) and correct in eager. - GroupQueryAttentionOnnxExport.forward keeps the non-raising placeholder, with a docstring explaining why raising is impossible (HTP hierarchy capture runs an eager forward outside trace/export). - Remove unused module-level logger in qwen_transformer_only.py (CodeQL). - Use a single onnx import form in test_quant_calibration.py (CodeQL). - Fix e2e _decoder_onnx_path helper to handle the single-model WinMLModelForGenericTask (.onnx_path) build, not just composite .sub_models.
…_type-override test - build_hf_model: look up winml_finalize_quant_config on type(pytorch_model) instead of the instance, and call it with explicit self. Fixes the mypy 'Tensor not callable' error (getattr yields Any) and stops the hook firing on raw HF models / MagicMock test doubles (whose attributes are instance-synthesized), which was serializing a MagicMock into config.json. - test_resolve_loader_config: replace the obsolete 'never mutated' test with one asserting the intended explicit-model_type override (needed for variants like qwen3_transformer_only).
Relocate the model-specific transformer-only calibration/quant logic out of
models/hf (an export-only package) into a new quant/calibration/ subpackage,
dispatched via a model_type-keyed registry that mirrors COMPOSITE_MODEL_REGISTRY.
- Add quant/calibration/{base,registry}.py: QuantConfigFinalizer protocol +
register_quant_finalizer / get_quant_finalizer (lazy, torch-free import).
- git mv qwen_transformer_only_quant.py -> quant/calibration/qwen3_transformer_only.py
and register Qwen3TransformerOnlyQuantFinalizer for 'qwen3_transformer_only'.
- build/hf.py: replace the winml_finalize_quant_config wrapper hook with explicit
registry dispatch keyed on config.model_type; unregistered types fall back to
the default DatasetCalibrationReader. Preserve the model_id/_name_or_path
fallback (now model-agnostic in the build layer).
- Remove the hook from the export wrapper (back to export-only).
- Relocate unit tests to tests/unit/quant/calibration/ and add test_registry.py.
w8a16 scheme unchanged; CPU e2e (quantized-graph + GQA-exclusion + FP-parity)
and 86 build/quant unit tests pass.
- annotate register_quant_finalizer return type (mypy no-untyped-def) - add TYPE_CHECKING re-imports so static analyzers see lazy __all__ exports (CodeQL py/undefined-export) - drop bare ... from finalizer Protocol; docstring is the body (CodeQL ineffectual-statement)
…transformer-only into subpackage The CLI-only _build_hf_pipeline did not pass loader.model_type to _load_model, so a config requesting qwen3_transformer_only was silently loaded as native qwen3 and crashed at export (embedding got HalfTensor). It also skipped the model-type quant finalizer, producing the default uint8/uint16 minmax scheme instead of the registered int8-sym / GQA-excluded policy. Both gaps existed only in the CLI path; the library build_hf_model already handled them. Mirror that logic so winml build produces the verified w8a16 graph (985 Q / 1294 DQ / 28 GQA / 0 QDQ-touching-GQA) end-to-end. Also move qwen3_export_ops, qwen3_modeling and qwen_transformer_only into a models/hf/qwen3/ subpackage and add regression tests for both fixes.
…rt script Reparent WinMLQwen3TransformerOnlyModel from WinMLDecoderOnlyModel to the plain WinMLCompositeModel. The decoder-only base wires a generation runtime from the eager KV signature (past_0_key) in __init__, which the transformer-only graph (past_keys_0 + symbolic axes) lacks, so from_pretrained crashed while constructing the handle even though both sub-model ONNX built fine. The build is export-only, so the plain composite base (which just stores the built sub-models) is the correct parent. Add a from_pretrained override that injects model_type=qwen3_transformer_only for every sub-model, so omitting model_type no longer silently builds the native (full) qwen3 architecture. Add scripts/export_qwen3_transformer_only.py to export the prefill + decode transformer-only pair in one call, with optional --output-dir copy. Add tests/unit/models/qwen3/test_transformer_only_composite.py covering the reparent, registry entry, and model_type injection.
# Conflicts: # tests/unit/commands/test_build.py
- build.py: keep model-type quant finalizer dispatch alongside main's quantize stage - quantizer.py: reapply per-target weight/activation symmetry override on top of main's refactored _quantize_qdq (mode-dispatched single-pass quantizer) - qwen3 transformer-only finalizer: pin mode=static so the new mode-keyed dispatch always routes the fixed w8a16 QDQ scheme (regardless of incoming precision policy), with a regression test
…gistry, non-mutating model_type override - registry: replace decorator/register API with a plain QUANT_FINALIZERS dict; get_quant_finalizer lazily imports + instantiates (review #1). - quantizer: resolve+apply the model-type-specific quant policy inside quantize_onnx from config.model_type, a single seam shared by all callers; drop the duplicated dispatch blocks in commands/build.py and build/hf.py (review #2). - loader: thread an explicit model_type as model_type_override through resolve_task instead of mutating hf_config.model_type, so exporters/patchers keep the architecture's native type while the loader config surfaces the build variant (review #4).
# Conflicts: # src/winml/modelkit/config/build.py # src/winml/modelkit/models/hf/__init__.py
@microsoft-github-policy-service agree |
Adds a transformer-only ONNX export path for Qwen3 that emits a fused (GQA) GroupQueryAttention op (with built-in rotary), LpNormalization RMSNorm, and 1×1 Conv projections, backed by an FP16 KV cache. The path is opt-in via install(), which hot-patches the build registries to produce two graphs (prefill seq=64, decode seq=1) without embeddings or lm_head. Quantization runs w8a16 static PTQ on these graphs using GSM8K calibration
Results
Produces two transformer-only ONNX files (prefill + decode) plus their w8a16-quantized variants.